Frontiers in Bioinformatics — Latest Matching Preprints

1

Variation in bulk RNA-seq and estimated cell type proportion using deconvolution when comparing pancreatic cancer samples within the same individual

Jansen, R. J.; Munro, S. A.; Antwi, S. O.; Rabe, K. G.; Sicotte, H.

2025-05-06 genetic and genomic medicine 10.1101/2025.05.05.25326976 medRxiv

Top 0.1%

31.3%

Show abstract

Introduction: There is great promise in using genomic data to inform individual cancer treatment plans. Assessing intratumor genetic heterogeneity, studies have shown it may be possible to target biopsies to tumor subclones driving disease progression or treatment resistance. Here, we explore if the interpretation of tumor gene expression analysis varies across two specimens from the same patient. Methods: We performed bulk RNA-seq using FFPE samples from 16 patients who also had a previous separate bulk RNA-seq performed and deposited in TCGA. We used three different deconvolution methods to compare cell type proportions for these paired data. We normalized study-specific gene expression values per gene by calculating transcripts per million and adjusted for batch effect across study to compare median expression values. We also compared the reliability of gene expression measurements. We selected KRAS, TP53, SMAD4, and CDKN2A, as the most mutated genes in pancreatic cancer, and CTNNB1, JUN, SMAD3, SMAD7, and TCF7, as these tend to be enriched in pancreatic cancer compared with adjacent normal tissue. Results: We found that average cell type proportion varied the most between studies (i.e., samples for each patient) for NK and macrophages (using adjusted p-value 0.05/21=0.002). For the differential expression analysis, we did not observe significant differences in average expression of any of the selected genes. We observed substantial (kappa=0.75) for only JUN with low to moderate concordance (i.e., Kappa value 0.25-0.5) when using a median cut point for the remaining 8 genes across the two studies. Discussion: Together, the findings suggest that more than one tumor sample may be needed for effective treatment planning. Any potential difference in observed expression values across the paired samples could be related to the different cell type proportions across the samples. The sample size was small, and each study used different sequencing technologies, so any interpretation should be confirmed with additional studies.

2

A systematic review on diagnostic and prognostic biomarkers for bladder cancer

Muhammad, U.; Ahmad, U.; Ibrahim, B.; Ahmad, A. A.; LIMAN, H. U.

2024-06-03 genetic and genomic medicine 10.1101/2024.06.02.24308331 medRxiv

Top 0.1%

16.1%

Show abstract

BackgroundBladder cancer is one of the most prevalent malignancies worldwide. Despite its high incidence, public awareness of the condition remains low, and it has received less research attention compared to other common cancers. Over the past 80 years, patient outcomes and treatment strategies have remained largely unchanged, with cystoscopy being the primary method for detecting bladder cancer. This procedure, often repeated during long-term surveillance due to the recurrent nature of bladder tumors, is both uncomfortable for patients and costly for healthcare providers. The identification and validation of molecular biomarkers in blood, urine, or tissue could facilitate tumour detection and reduce reliance on cystoscopy. AimThis study aims to identify potential molecular biomarkers for bladder cancer that could improve tumour detection and lessen the need for repeated cystoscopies. MethodsA systematic review was conducted, searching for articles related to bladder cancer biomarkers in four databases: PubMed, ScienceDirect, Google Scholar, and Cochrane. Studies that met the inclusion criteria underwent title/abstract screening and full-text review. A total of twenty studies were deemed eligible for inclusion in this review. ResultsThe review identified several gene product biomarkers, including TEAD4, TPM1, TPM2, SKA3, EO1, HYAL3, MTDH, EPDR1, hTERT, KRT7, SW, ARHGAP9, XPH4, OTX1, BUB1, and Usp28. Additionally, protein product biomarkers were identified, such as A1AT, APOE, AG, CA9, IL8, MMP9, MMP10, PAI1, SCDI1, SDC1, VEGFA, CD73, TIP2, CXCL5, PCAT6, and NCR3LG1 (B7-H6). ConclusionThe study highlights the potential of various gene and protein biomarkers for the detection of bladder cancer. Further research is necessary to validate these biomarkers diagnostic and prognostic potential in identifying bladder cancer in suspected cases.

3

Network-based integration of gene expression and DNA methylation identifies prognostic biomarkers for early-stage pancreatic cancer

T, D.; Anbarasu, K.; Rama Hebbar, S.; KR, D.; Vasudevan, K.; Rohini, K.

2026-02-11 cancer biology 10.64898/2026.02.09.704985 medRxiv

Top 0.1%

15.2%

Show abstract

Pancreatic ductal adenocarcinoma remains one of the most lethal malignancies, largely due to the absence of reliable early-stage biomarkers. Here, we present a network-based multi-omics framework that integrates gene expression and DNA methylation data through partial correlation analysis to uncover prognostic markers. Four distinct networks were constructed: gene expression co-expression, methylation-only, multiplex (inter-layer connections linking the same genes across omics layers), and monoplex (fused multi-omics). Weighted gene co-expression network analysis (WGCNA) was applied to each network to select non-redundant, topologically representative hub genes as features for machine learning classification. Models trained on cross-layer (multiplex) features achieved an ROC of 82%, compared with 50-60% using single-omics features alone. The most strongly associated genes with poor prognosis include TFCP2L1, DHX32, and NCK1.

4

Get to know your neighbors with a SNAQ™: A framework for single cell spatial neighborhood analysis in immunohistochemical images

Silver, A.; Chakraborty, A.; Pittu, A.; Feier, D.; Anica, M.; West, I.; Sarkisian, M. R.; Deleyrolle, L. P.

2024-08-07 bioinformatics 10.1101/2024.08.04.606539 medRxiv

Top 0.1%

12.0%

Show abstract

MotivationAnalyzing the local microenvironment of tumor cells can provide significant insights into their complex interactions with their cellular surroundings, including immune cells. By quantifying the prevalence and distances of certain immune cells in the vicinity of tumor cells through a neighborhood analysis, patterns may emerge that indicate specific associations between cell populations. Such analyses can reveal important aspects of tumor-immune dynamics, which may inform therapeutic strategies. This method enables an in-depth exploration of spatial interactions among different cell types, which is crucial for research in oncology, immunology, and developmental biology. ResultsWe introduce an R Markdown script called SNAQTM (Single-cell Spatial Neighborhood Analysis and Quantification), which conducts a neighborhood analysis on immunofluorescent images without the need for extensive coding knowledge. As a demonstration, SNAQTM was used to analyze images of pancreatic ductal adenocarcinoma. Samples stained for DAPI, PanCK, CD68, and PD-L1 were segmented and classified using QuPath. The resulting CSV files were exported into RStudio for further analysis and visualization using SNAQTM. Visualizations include plots revealing the cellular composition of neighborhoods around multiple cell types within a customizable radius. Additionally, the analysis includes measuring the distances between cells of certain types relative to others across multiple regions of interest. Availability and implementationThe R Markdown files that comprise the SNAQTM algorithm and the input data from this paper are freely available on the web at https://github.com/AryehSilver1/SNAQ. Visual Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=70 SRC="FIGDIR/small/606539v1_ufig1.gif" ALT="Figure 1"> View larger version (35K): org.highwire.dtl.DTLVardef@1cda240org.highwire.dtl.DTLVardef@157280forg.highwire.dtl.DTLVardef@1072cf8org.highwire.dtl.DTLVardef@1964e0_HPS_FORMAT_FIGEXP M_FIG C_FIG Created with BioRender.com.

5

Vorpal: A Novel RNA Virus Feature-Extraction Algorithm Demonstrated Through Interpretable Genotype-to-Phenotype Linear Models

Davis, P.; Bagnoli, J.; Yarmosh, D.; Shteyman, A.; Presser, L.; Altmann, S.; Bradrick, S.; Russell, J. A.

2020-03-02 bioinformatics 10.1101/2020.02.28.969782 medRxiv

Top 0.1%

11.1%

Show abstract

In the analysis of genomic sequence data, so-called "alignment free" approaches are often selected for their relative speed compared to alignment-based approaches, especially in the application of distance comparisons and taxonomic classification1,2,3,4. These methods are typically reliant on excising K-length substrings of the input sequence, called K-mers5. In the context of machine learning, K-mer based feature vectors have been used in applications ranging from amplicon sequencing classification to predictive modeling for antimicrobial resistance genes6,7,8. This can be seen as an analogy of the "bag-of-words" model successfully employed in natural language processing and computer vision for document and image classification9,10. Feature extraction techniques from natural language processing have previously been analogized to genomics data11; however, the "bag-of-words" approach is brittle in the RNA virus space due to the high intersequence variance and the exact matching requirement of K-mers. To reconcile the simplicity of "bag-of-words" methods with the complications presented by the intrinsic variance of RNA virus space, a method to resolve the fragility of extracted K-mers in a way that faithfully reflects an underlying biological phenomenon was devised. Our algorithm, Vorpal, allows the construction of interpretable linear models with clustered, representative degenerate K-mers as the input vector and, through regularization, sparse predictors of binary phenotypes as the output. Here, we demonstrate the utility of Vorpal by identifying nucleotide-level genomic motif predictors for binary phenotypes in three separate RNA virus clades; human pathogen vs. non-human pathogen in Orthocoronavirinae, hemorrhagic fever causing vs. non-hemorrhagic fever causing in Ebolavirus, and human-host vs. non-human host in Influenza A. The capacity of this approach for in silico identification of hypotheses which can be validated by direct experimentation, as well as identification of genomic targets for preemptive biosurveillance of emerging viruses, is discussed. The code is available for download at https://github.com/mriglobal/vorpal.

6

Classification of ovarian cancer cell lines using transcriptional profiles defines the five major pathological subtypes

Barnes, B.; Nelson, L.; Tighe, A.; Morgan, R.; McGrail, J.; Taylor, S. S.

2020-07-15 cancer biology 10.1101/2020.07.14.202457 medRxiv

Top 0.1%

10.9%

Show abstract

Epithelial ovarian cancer (EOC) is a heterogenous disease consisting of five major pathologically distinct subtypes: High-grade serous ovarian carcinoma (HGSOC), low-grade serous (LGS), endometrioid, clear cell and mucinous carcinoma. Although HGSOC is the most prevalent subtype, representing approximately 75% of cases, a 2013 landmark study from Domcke et al., found that many frequently used ovarian cancer cell lines were not genetically representative of HGSOC tissue samples from The Cancer Genome Atlas. Although this work subsequently identified several rarely used cell lines to be highly suitable as HGSOC models, cell line selection for ovarian cancer research does not appear to have altered substantially in recent years. Here, we find that application of non-negative matrix factorisation (NMF) to the transcriptional profiles of 45 commonly used ovarian cancer cell lines exquisitely clusters them into five distinct classes, representative of the five main subtypes of EOC. This methodology was in strong agreement with Domcke et al., in identification of cell lines most representative of HGSOC. Furthermore, this robust classification of cell lines, including some previously not annotated or miss-annotated in the literature, now informs selection of the most appropriate models for all five pathological subtypes of ovarian cancer. Furthermore, using machine learning algorithms trained using the classification of the current cell lines, we are able provide a methodology for future classification of novel EOC cell lines.

7

Protein Language Models Expose Viral Mimicryand Immune Escape

Ofer, D.; Linial, M.

2024-03-15 bioinformatics Community evaluation 10.1101/2024.03.14.585057 medRxiv

Top 0.1%

10.8%

Show abstract

MotivationViruses elude the immune system through molecular mimicry, adopting biophysical characteristics of their host. We adapt protein language models (PLMs) to differentiate between human and viral proteins. Understanding where the immune system and our models make mistakes could reveal viral immune escape mechanisms. ResultsWe applied pretrained deep-learning PLMs to predict viral from human proteins. Our predictors show state-of-the-art results with AUC of 99.7%. We use interpretable error analysis models to characterize viral escapers. Altogether, mistakes account for 3.9% of the sequences with viral proteins being disproportionally misclassified. Analysis of external variables, including taxonomy and functional annotations, indicated that errors typically involve proteins with low immunogenic potential, viruses specific to human hosts, and those using reverse-transcriptase enzymes for their replication. Viral families causing chronic infections and immune evasion are further enriched and their protein mimicry potential is discussed. We provide insights into viral adaptation strategies and highlight the combined potential of PLMs and explainable AI in uncovering mechanisms of viral immune escape, contributing to vaccine design and antiviral research. Availability and implementationData and results available in https://github.com/ddofer/ProteinHumVir. Contactmichall@cc.huji.ac.il

8

CoMPHI: A Novel Composite Machine Learning Approach Utilizing Multiple FeatureRepresentation to Predict Hosts of Bacteriophages

Bodaka, S.; Malgonde, O.

2024-08-02 bioinformatics 10.1101/2024.07.29.604684 medRxiv

Top 0.1%

10.6%

Show abstract

Phage therapy has reemerged as a compelling alternative to antibiotics in treating bacterial infections, especially for superbugs that have developed antibiotic resistance. The challenge in the broader application of phage therapy is identifying host targets for the vast array of uncharacterized phages obtained through next-generation sequencing. To solve this issue, this paper introduces an innovative Composite Model for Phage Host Interaction, CoMPHI, to predict phage-host interactions by combining the accuracy of alignment-based methods with the efficiency and flexibility of machine learning techniques. The model initially generates multiple feature encodings from nucleotide and protein sequences of both phages and hosts to enhance prediction accuracies. It is further enriched by incorporating alignment scores between phage-phage, phage-host, and host-host, creating a composite model. During the 5-fold cross-validation, the composite model exhibited an Area Under the ROC Curve (AUC) of 94%, 96.4%, 96.5%, 96.6%, 96.6%, and 96.7% and accuracy of 92.3%, 93.3%, 93.6%, 94%, 94.9%, and 95.1% at the Species, Genus, Family, Order, Class, and Phylum levels, respectively. A comparative analysis revealed a 6-8% increase in model performance due to the inclusion of alignment scores. Additionally, an ablation study highlighted that including both nucleotide and protein sequences from both phages and hosts increased the prediction accuracy of the model. Another ablation study provided evidence that phage-host and host-host alignment scores, combined with phage-phage scores, equally contributed to enhancing the composite models performance. In conclusion, this paper presents a robust and comprehensive composite model advancing the use of phage therapy in modern medicine.

9

Machine Learning Approach to Integrate and Analyse Multiomics data to Identify Actionable Biomarkers for Head and Neck Squamous Cell Carcinoma (HNSCC)

Panchal, K.; Arockia Rajesh Packiam, K.; MAJUMDAR, S.

2025-10-13 genetic and genomic medicine 10.1101/2025.10.09.25335922 medRxiv

Top 0.1%

10.5%

Show abstract

Head and neck squamous cell carcinoma (HNSCC) is ranked sixth among all the common cancers worldwide and is a major cause of death. A molecular understanding of disease progression can aid in timely diagnosis and therapy. This study aims to identify potential HNSCC biomarkers using a machine learning-based approach to integrate and analyse multi-omics data (namely publicly available Human Papillomavirus (HPV) negative patients multiomics datasets from the CPTAC-HNSCC project, including transcriptomics, methylomics, proteomics, and phosphoproteomics). A three-step feature selection method was utilized to identify potential molecular biomarkers using machine learning algorithms. The top 1000 important features (genes) were filtered using Mutual Information, followed by a random forest-based feature importance ranking, and Recursive Feature Elimination with cross-validation coupled with Support Vector Machine (SVM-RFECV) to get a minimal gene set important for machine learning based tumor-normal classification task. To benchmark these top-selected features, Logistic Regression (LogR), Random Forest (RF), Multi-layer perceptron (MLP), and Support Vector Machines (SVC) were used. The prediction performance of classifiers trained on these selected gene sets was evaluated using the accuracy metric, which was then compared against that of models trained on randomly selected gene sets. The entire workflow was repeated 100 times for different random states to establish statistical confidence in the pipeline and the selected gene set. Our integrative approach identified both omics-specific and cross-omics candidate genes with very high classification accuracy, ranging from [~] 95% to 100%. These genes reveal convergent biological processes central to HNSCC pathogenesis, which reinforces the robustness of the methodology used, which can be adopted to analyse similar multiomics datasets for other pathologies and foundational biological questions.

10

A Fusion-Based Multiomics Classification Approach for Enhanced Gene Discovery in Non-Small Cell Lung Cancer

Dwivedi, K.; Mahbod, A.; Ecker, R. C.; Janjic, K.

2025-05-05 oncology 10.1101/2025.05.02.25326847 medRxiv

Top 0.1%

10.4%

Show abstract

This study introduces a fusion-based multiomics approach to identifying non-small cell lung cancer (NSCLC)-relevant genes. We evaluated the NSCLC-subtype classification performance of various state-of-the-art machine learning models using single omics and fused multiomics approaches. The models were trained separately on individual omics data sets. Subsequently, a weighted-average-based decision-level fusion mechanism was employed to integrate the individual predictions of the trained models. Finally, the prediction performance across all the approaches was compared. The decision-level fusion-based approach yielded a superior classification performance as compared to the performance achieved by models trained on individual omics data sets. Finally, a set of 47 NSCLC-relevant genes were identified. For the first time, ABCF3, ACAP2, LSG1, TBCCD1, UCN2, WDR53, ZNF639 and FYTTD1 appeared in the context of NSCLC. In conclusion, the integration of multiple omics types showed potential to deliver a more concise selection of NSCLC-relevant genes that could be clinically targeted in future. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=131 SRC="FIGDIR/small/25326847v1_ufig1.gif" ALT="Figure 1"> View larger version (36K): org.highwire.dtl.DTLVardef@b3b3a5org.highwire.dtl.DTLVardef@1d71e34org.highwire.dtl.DTLVardef@1ffeaeaorg.highwire.dtl.DTLVardef@97e111_HPS_FORMAT_FIGEXP M_FIG C_FIG

11

A novel artificial lung organoid for simulating a patient derived adenocarcinoma of lung for personalized oncology.

Esmail, S.; Danter, W. R.

2021-04-27 oncology 10.1101/2021.04.20.21255803 medRxiv

Top 0.1%

9.9%

Show abstract

Optimizing patient care based on precision oncology will inevitably become the standard of care. If we accept the principle that every persons cancer is different then the most effective therapies will have to be designed for the individual patient and for their tumors genetic profile. Access to tumor mutational profiling is now widely available but continues to be limited by cost and actionable information. For example, novel combinations of approved drugs are rarely considered. These considerations lead us to hypothesize that artificially induced Lung Adenocarcinoma (LUAD) derived lung organoids could provide a novel, alternate approach for LUAD disease modeling and large-scale targeted drug screening. In this project, we used data from a commercially available tumor mutation profile to generate and then validate the artificially induced LUAD-derived lung organoid simulations (aiLUNG-LUAD) to model LUAD and identify several drug combinations that effectively reverse the tumors genotypic and phenotypic features when compared with placebo. These results complement previous LUAD-derived lung organoids research and provide a novel and widely applicable cancer drug-screening approach for precision/individualized oncology.

12

3D IntelliGenes: AI/ML application using multi-omics data for biomarker discovery and disease prediction with multi-dimensional visualization

Narayanan, R.; Peker, E.; Degroat, W.; Mendhe, D.; Zeeshan, S.; Ahmed, Z.

2025-03-26 genetic and genomic medicine 10.1101/2025.03.25.25324634 medRxiv

Top 0.1%

9.7%

Show abstract

BackgroundThe cutting-edge AI/ML techniques have proven effective at uncovering elucidative knowledge on disease-causing biomarkers and the biological underpinnings of a plethora of human diseases. However, the high-dimensional nature of multi-omics data presents numerous challenges in its effective presentation, annotation, and interpretation. Traditional 2D visualizations often fall short in capturing the intricate relationships between multi-omics features, hindering our ability to identify meaningful correlations. MethodsIn this study, we focused on addressing such challenges by developing an innovative solution to better visualize results produced by AI/ML approaches on integrated clinical and multi-omics data for novel biomarker discovery and predictive analysis. We present an advanced version of our earlier published software with intuitive and interactive visualizations of multi-omics data in multi-dimensions i.e., 3D IntelliGenes, which offers deeper insights, most importantly by capturing greater variability in the patient data by understanding both linear and non-linear structures, evaluating AI/ML model performance, and delineating the joint impact of biomarkers on the corresponding disease states. ResultsThe overall functionality of 3D IntelliGenes is divided into two modules, data clustering and feature plotting. The data clustering module creates configurable 3D scatter plots to visualize the structure-preserving distribution of disease states, AI/ML classifier bias in the form of type I/II errors, and patient similarity through a robust density-driven clustering algorithm. Whereas the feature plotting module supports the joint analysis of pairs of multi-omics features to analyze the interdependence and discriminative power of co-expressed biomarkers. ConclusionWe report evaluated performance of 3D IntelliGenes using diverse cohorts of patients with cardiovascular and other diseases.

13

Benchmarking of AlphaFold2 accuracy self-estimates as empirical quality measures and model ranking indicators and their comparison with independent model quality assessment programs.

Edmunds, N. S.; McGuffin, L. J.; Genc, A. G.

2023-12-15 bioinformatics 10.1101/2023.12.15.571846 medRxiv

Top 0.1%

9.7%

Show abstract

MotivationDespite an increase in the accuracy of predicted protein structures following the development of AlphaFold2, there remains a gap in the accuracy of predicted model quality assessment scores when compared to those generated with reference to experimental structures. The predictions of model accuracy scores generated by AlphaFold2, plDDT and pTM, have become familiar descriptors of model quality. However, at CASP15 some modelling groups noticed a variation in these scores for models of very similar observed quality, particularly for quaternary structures. There have also been a number of methods describing adaptations of the AlphaFold2 algorithm to purposes such as refinement by custom template recycling and model quality assessment using a similar method of template input. In this study we compare plDDT and pTM to their observed counterparts lDDT (including lDDT-C and lDDT-oligo) and TM-score to examine whether they retain their reliability across the whole scoring range for both tertiary and quaternary structures and in situations where the AlphaFold2 algorithm is adapted to customised functionality. In addition, we explore the accuracy with which plDDT and pTM rank AlphaFold2 tertiary and quaternary models and whether these can be improved by the independent model quality assessment programs ModFOLD9 and ModFOLDdock. ResultsFor tertiary structures it was found that plDDT was an accurate descriptor of model quality when compared to observed lDDT-C scores (Pearson {rho} = 0.97). Additionally, plDDT achieved a tertiary structure ranking agreement with observed scores of 0.34 as measured by true positive rate (TPR) and ModFOLD9 offered similar but not improved performance. However, the accuracy of plDDT (Pearson {rho} = 0.67) and pTM (Pearson {rho} = 0.70) became more variable for quaternary structures quality assessment where overprediction was seen with both scores for models of lower quality and underprediction was also seen with pTM for models of higher quality. Importantly, ModFOLDdock was able to improve upon AF2-Multimer quaternary structure model ranking as measured by both TM-score (TPR 0.34) and lDDT-oligo (TPR 0.43). Finally, evidence is presented for an increase in variability of both plDDT and pTM when custom template recycling is used, and that this variation is more pronounced for quaternary structures.

14

TCR-H: Machine Learning Prediction of T-cell Receptor Epitope Binding on Unseen Datasets

Tatikonda, R. R.; Demerdash, O.; Smith, J. C.

2023-12-15 immunology 10.1101/2023.11.28.569077 medRxiv

Top 0.1%

9.6%

Show abstract

AI/ML approaches to predicting T-cell receptor (TCR) epitope specificity achieve high performance metrics on test datasets which include sequences that are also part of the training set but fail to generalize to test sets consisting of epitopes and TCRs that are absent from the training set, i.e., unseen. We present TCR-H, a supervised classification Support Vector Machines model using physicochemical features trained on the largest dataset available to date using only experimentally validated non-binders as negative datapoints. TCR-H exhibits an area under the curve of the receiver-operator characteristic (AUC of ROC) of 0.87 for epitope hard splitting (i.e., on test sets with all epitopes unseen), 0.92 for TCR hard splitting and 0.89 for strict splitting in which neither the epitopes nor the TCRs in the test set are seen in the training data. TCR-H may thus represent a significant step towards general applicability of epitope:TCR specificity prediction.

15

Constraint-based modelling predicts metabolic signatures of low- and high-grade serous ovarian cancer

Meeson, K.; Schwartz, J.-M.

2023-03-12 bioinformatics 10.1101/2023.03.09.531870 medRxiv

Top 0.1%

9.6%

Show abstract

Ovarian cancer is an aggressive, heterogeneous disease, burdened with late diagnosis and resistance to chemotherapy. Clinical features of ovarian cancer could be explained by investigating its metabolism, and how the regulation of specific pathways link to individual phenotypes. Ovarian cancer is of particular interest for metabolic research due to its heterogeneous nature, with five distinct subtypes having been identified, each of which may display a unique metabolic signature. To elucidate metabolic differences, constraint-based modeling (CBM) represents a powerful technology, inviting the integration of omics data, such as transcriptomics. However, many CBM methods have not prioritised accurate growth rate predictions, and there are very few ovarian cancer genome-scale studies, thus highlighting a niche in disease research. Here, a novel method for constraint-based modeling has been developed, employing the genome-scale model Human1 and flux balance analysis (FBA), enabling the integration of in vitro growth rates, transcriptomics data and media conditions to predict the metabolic behaviour of cells. Using low- and high-grade ovarian cancer as a case study, subtype-specific metabolic differences have been predicted, which have been supported with CRISPR-Cas9 data and an extensive literature review. Metabolic drivers of aggressive phenotypes, as well as pathways responsible for increased proliferation and chemoresistance in low-grade cell lines have been suggested. Experimental gene dependency data has been used to validate fatty acid biosynthesis and the pentose phosphate pathway as essential for low-grade cellular growth, highlighting potential vulnerabilities for this ovarian cancer subtype.

16

An Integrated Deep Learning Framework for Small-Sample Biomedical Data Classification: Explainable Graph Neural Networks with Data Augmentation for RNA sequencing Dataset

Guler, F.; Goksuluk, D.; Xu, M.; Choudhary, G.; agraz, m.

2026-02-24 genetic and genomic medicine 10.64898/2026.02.22.26346827 medRxiv

Top 0.1%

9.6%

Show abstract

Applying deep learning models to RNA-Seq data poses substantial challenges, primarily due to the high dimensionality of the data and the limited sample sizes. To address these issues, this study introduces an advanced deep learning pipeline that integrates feature engineering with data augmentation. The engineering application focuses on biomedical engineering, specifically the classification of RNA-Seq datasets for disease diagnosis. The proposed framework was initially validated on synthetic datasets generated from Naive Bayes, where MLP-based augmentation yielded a notable improvement in predictive performance. Building on this foundation, we applied the approach to chromophobe renal cell carcinoma (KICH) RNA-Seq data from The Cancer Genome Atlas (TCGA). Following standard preprocessing steps normalization, transformation, and dimensionality reduction, the analysis concentrated on three main aspects: augmentation strategies, preprocessing methods, and explainable AI (XAI) techniques in relation to classification outcomes. Feature selection was performed through PCA, Boruta, and RF-based methods. Three augmentation strategies linear interpolation, SMOTE, and MixUp were evaluated. To maintain methodological rigor, augmentation was applied exclusively to the training set, while the test set was held out for unbiased evaluation. Within this framework, we conducted a comparative assessment of multiple deep learning architectures, including MLP, GNN, and the recently proposed Kolmogorov-Arnold networks (KAN). The GNN achieved the highest classification accuracy (99.47%) when trained with MixUp augmentation combined with RF feature selection, and achieved the best F1 score (0.9948). Consequently, the GNN-based XAI framework was applied to the RF dataset enriched with MixUp. XAI analyses identified the top 20 most influential genes, such as HNF4A, DACH2, MAPK15, and NAT2, which played the greatest role in classification, thereby confirming the biological plausibility of the model outputs. To further validate model robustness, cervical cancer and Alzheimers RNA-Seq datasets were also tested, yielding consistent and reliable results. Overall, the findings highlight the value of incorporating data augmentation into deep learning models for RNA-Seq analysis, not only to improve predictive performance but also to enhance biological interpretability through explainable AI approaches.

17

Bioschemas Training Profiles: A set of specifications for standardizing training information to facilitate the discovery of training programs and resources

Jael Castro, L.; Palagi, P. M.; Beard, N.; Bioschemas Training Profiles Group Members, ; ELIXIR FAIR Training Focus Group, ; The GOBLET Foundation, ; Attwood, T. K.; Brazas, M. D.

2022-11-29 scientific communication and education 10.1101/2022.11.24.516513 medRxiv

Top 0.1%

9.5%

Show abstract

Stand-alone life science training events and e-learning solutions are amongst the most sought-after modes of training because they address both point-of-need learning and the limited timeframes available for upskilling. Yet, finding relevant life sciences training courses and materials is challenging because such resources are not marked up for Internet searches in a consistent way. This absence of mark-up standards to facilitate discovery, re-use and aggregation of training resources limits their usefulness and knowledge-translation potential. Through a joint effort between the Global Organisation for Bioinformatics Learning, Education and Training (GOBLET), the Bioschemas Training community and the ELIXIR FAIR Training Focus Group, a set of Bioschemas Training profiles has been developed, published and implemented for life sciences training courses and materials. Here, we describe our development approach and methods, which were based on the Bioschemas model, and present the results for the three Bioschemas Training profiles: TrainingMaterial, Course and CourseInstance. Several implementation challenges were encountered, which we discuss alongside potential solutions. Over time, continued implementation of these Bioschemas Training profiles by training providers will obviate the barriers to skill development, facilitating both the discovery of relevant training events to meet individuals learning needs, and the discovery and re-use of training and instructional materials.

18

Nuclear Irregularity as a Universal Diagnostic Tool in Solid Tumors

Hamilton, F.; Foster, K.

2025-08-15 bioinformatics 10.1101/2025.08.12.669986 medRxiv

Top 0.1%

9.1%

Show abstract

As tumors develop, cancer cells accumulate diverse genomic and phenotypic alterations to meet heightened demands for energy production and biosynthesis. Loss of lamina function and perturbations in energy production are associated with pronounced aberrations in cellular morphology, particularly within nuclear architecture and the plasma membrane. Systematic analysis of nuclear morphology can reveal conserved structures across diverse cancer types, enabling disease state stratification, biomarker discovery, and potential avenues for personalizing therapy to minimize recurrence risk. To this end, this study analyzes an imaging mass cytometry (IMC) breast cancer dataset, differentiating cancerous and non-cancerous nuclei with a p-value of 1.02e-06. In addition, this study achieves an accuracy of 78 percent using a computational and machine learning-based pipeline for analyzing the morphological heterogeneity of nuclei and protein expression, enabling characterization of patient-specific tumor phenotypes. Unlike traditional morphology analysis pipelines limited to specific imaging platforms, this workflow enables cross-cohort and cross-cancer comparison, capturing tumor-specific phenotypic deviations at a single-cell resolution. The resulting pheno-typic profiles could inform prognosis, treatment, and monitoring of therapeutic response. SummaryAs cancers become more aggressive and require more energy, typically uniform and organized cells begin to develop abnormal features to support their heightened needs. Studies have found that the prevalence of abnormal features is directly associated with the speed at which the tumor grows, but also the bodys ability to fight back. This study aims to streamline the analysis of these irregular features across all cancer types, providing a clearer picture of how nuclei distinguish stages of cancer and aid in rapidly clinically assessing at-risk or affected patients. Using the nuclear abnormality score developed, this study was able to identify sub-populations of highly irregular cancer cells, and successfully separate them with a p-value of 1.222e-21. By comparing the expression of cancer proteins with these irregularities, we can begin to develop insights that can be used across all imaging techniques to understand the cancers inner workings and learn to predict relapse before it even occurs.

19

BioModelsML: Building a FAIR and reproducible collection of machine learning models in life sciences and medicine for easy reuse

Tiwari, D. D.; Hoffmann, N.; Didi, K.; Deshpande, S.; Ghosh, S.; Nguyen, T. V. N.; Raman, K.; Hermjakob, H.; Malik Sheriff, R. S.

2023-05-23 bioinformatics 10.1101/2023.05.22.540599 medRxiv

Top 0.1%

9.0%

Show abstract

Machine learning (ML) models are widely used in life sciences and medicine; however, they are scattered across various platforms and there are several challenges that hinder their accessibility, reproducibility and reuse. In this manuscript, we present the formalisation and pilot implementation of community protocol to enable FAIReR (Findable, Accessible, Interoperable, Reusable, and Reproducible) sharing of ML models. The protocol consists of eight steps, including sharing model training code, dataset information, reproduced figures, model evaluation metrics, trained models, Dockerfiles, model metadata, and FAIR dissemination. Applying these measures we aim to build and share a comprehensive public collection of FAIR ML models in the BioModels repository through incentivized community curation. In a pilot implementation, we curated diverse ML models to demonstrate the feasibility of our approach and we discussed the current challenges. Building a FAIReR collection of ML models will directly enhance the reproducibility and reusability of ML models, minimising the effort needed to reimplement models, maximising the impact on the application and significantly accelerating the advancement in the field of life science and medicine.

20

A 2D convolutional neural network for taxonomic classification applied to viruses in the phylum Cressdnaviricota

Gomes, R. A. L.; Zerbini, F. M.

2023-05-02 bioinformatics 10.1101/2023.05.01.538983 medRxiv

Top 0.1%

8.8%

Show abstract

Taxonomy, defined as the classification of different objects/organisms into defined stable hierarchical categories (taxa), is fundamental for proper scientific communication. In virology, taxonomic assignments based on sequence alone are now possible and their use may contribute to a more precise and comprehensive framework. The current major challenge is to develop tools for the automated classification of the millions of putative new viruses discovered in metagenomic studies. Among the many tools that have been proposed, those applying machine learning (ML), mainly in the deep learning branch, stand out with highly accurate results. One ML tool recently released that uses k-mers, VirusTaxo, was the first one to be applied with success, 93% average accuracy, to all types of viruses. Nevertheless, there is a demand for new tools that are less computationally intensive. Viruses classified in the phylum Cressdnaviricota, with their small and compact genomes, are good subjects for testing these new tools. Here we tested the usage of 2D convolutional neural networks for the taxonomic classification of cressdnaviricots, also testing the effect of data imbalance and two augmentation techniques by benchmarking against VirusTaxo. We were able to get perfect classification during k-fold test evaluations for balanced taxas, and more than 98% accuracy in the final pipeline tested for imbalanced datasets. The mixture of augmentation on more imbalanced groups and no augmentation for more balanced ones achieved the best score in the final test. These results indicate that these architectures can classify DNA sequences with high precision.